A visual context-aware multimodal system for spoken language processing

نویسندگان

  • Niloy Mukherjee
  • Deb Roy
چکیده

Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first acquires a grammar and a visually grounded lexicon from a “showand-tell” procedure where the training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the system generates a dynamic visually-grounded language model and drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The multimodal nature of spoken word processing in the visual world: Testing the predictions of alternative models of multimodal integration

Ambiguity in natural language is ubiquitous (Piantadosi, Tily & Gibson, 2012), yet spoken communication is effective due to integration of information carried in the speech signal with information available in the surrounding multimodal landscape. However, current cognitive models of spoken word recognition and comprehension are underspecified with respect to when and how multimodal information...

متن کامل

Spontaneous Speech Recognition Using Visual Context-Aware Language Models

The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes a...

متن کامل

A comprehensive model of spoken word recognition must be multimodal: Evidence from studies of language mediated visual attention

When processing language, the cognitive system has access to information from a range of modalities (e.g. auditory, visual) to support language processing. Language mediated visual attention studies have shown sensitivity of the listener to phonological, visual, and semantic similarity when processing a word. In a computational model of language mediated visual attention, that models spoken wor...

متن کامل

Language as a multimodal phenomenon: implications for language learning, processing and evolution.

Our understanding of the cognitive and neural underpinnings of language has traditionally been firmly based on spoken Indo-European languages and on language studied as speech or text. However, in face-to-face communication, language is multimodal: speech signals are invariably accompanied by visual information on the face and in manual gestures, and sign languages deploy multiple channels (han...

متن کامل

Making Relative Sense: From Word-Graphs To Semantic Frames

Scaling up from controlled single domain spoken dialogue systems towards conversational, multi-domain and multimodal dialogue systems poses new challenges for the reliable processing of less restricted user utterances. In this paper we explore the feasibility to employ a general purpose ontology for various tasks involved in processing the user’s utterances.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003